enhancement(metrics): support for metrics v3 protocol#1175
Conversation
Binary Size Analysis (Agent Data Plane)Baseline: 08cd4c4 · Comparison: 53976e1 · diff ✅ Binary size difference within thresholdChanges by Module
Detailed Symbol Changes |
Regression Detector (Agent Data Plane)Run ID: Optimization Goals: ✅ No significant changes detectedFine details of change detection per experiment (35)Experiments configured
Bounds Checks: ✅ Passed (5)
ExplanationA change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression ( |
|
This is temporarily blocked on there being a version of the Datadog Agent for us to test against in correctness tests that has up-to-date v3 metrics support. Currently, we're hitting an issue related to rate intervals being delta encoded when they shouldn't be. That bug is fixed in DataDog/datadog-agent#45825 but won't be released until 7.77: roughly 2 weeks from now before an RC is available to use. We can potentially do a hacky image build or something for keep going in the meantime and then switch back to a proper Agent version once available, we'll see. |
79cdda1 to
59636cd
Compare
|
We've temporarily handled the issue of correctness tests by using a "dev" container image ( We can't merge this as-is: we need to wait for at least an RC build of Datadog Agent 7.77 so we can pin to a non-development image. In the meantime, I'm going to work on making sure we've integrated all of the same small fixes/changes that have been steadily being made upstream in the Datadog Agent repository for V3 support. |
30ee642 to
898021d
Compare
be9a81c to
a9f5109
Compare
a9f5109 to
31b5f82
Compare
| }); | ||
| let v3_flushed = if let Some(v3_metrics) = maybe_v3_metrics { | ||
| if v2_flushed || v3_metrics.len() >= v3_endpoint_config.max_metrics_per_payload() { | ||
| encode_and_flush_v3_metrics(endpoint, &v3_endpoint_config, v3_metrics, &telemetry, &mut payloads_tx, batch_id.as_ref(), v3_payload_info).await?; |
There was a problem hiding this comment.
This doesn't seem to observe any intake payload size limits, or am I missing anything?
There was a problem hiding this comment.
That's correct.
Right now, we're either flushing with the V2 encoder determines it needs to flush (so that we generate an equivalent payload in terms of the contained metrics between the two) or if we exceed the configured maximum metrics per payload limit.
In V2/V3 mode, I suppose it's entirely possible to have the V3 payload exceed the payload limits, although it would be incredibly unlikely. In V3 only mode, it's obviously a much more likely risk.
My thought process was that we would improve this -- make V3 encoding aware of the payload limits -- at the same time we added incremental compression to match the behavior of the Agent... since back when this was originally written many weeks ago, it seemed like we'd have enough time between then and "V3 only in production / for customers" to do the follow-up work.
I guess the question I have is: do you feel like we still have that sort of time before we want to be running V3 only?
| Ok(encoded) => { | ||
| match create_v3_request("/api/intake/metrics/v3/series", encoded, ep_config.compression_scheme()).await { | ||
| Ok(request) => { | ||
| flush_payload(request, events, payloads_tx, batch_id, 0, 1, payload_info).await?; |
There was a problem hiding this comment.
Is it intentional that batch_seq and batch_len are hard-coded as 0 and 1?
There was a problem hiding this comment.
It is intentional, but only in the context of us not currently splitting V3 payloads: there can literally only ever be a single V3 payload in each batch.
(Mostly related to the question you left about not obeying intake payload size limits.)
There was a problem hiding this comment.
That doesn't sound right: there can be multiple v2 payloads constructed from even a single event buffer, so even if they mach one to one, there should be multiple v3 ones. Or are we using word batch differently here?
There was a problem hiding this comment.
There's probably some missing context/mismatched terminology here, yeah.
Every time we receive an "event buffer", we process all of the metrics in the event buffer, incrementally encoding them via the request builder. As we're doing that, we might reach the point where a payload is "full" (aka adding another metric would cause the final payload to exceed the (un)compressed size limits) and we have to flush it before encoding the next metric in the event buffer. That scenario is what I'm referring to here.
Unlike the Core Agent, we don't immediately emit partial payloads for the remainder of an event buffer: if we finish processing an event buffer, and there's still remaining metrics in the request builder, we wait for a period of time for additional event buffers to come in and eventually time out and flush that partial payload.
ca5ebc6 to
16afa2e
Compare
|
f27ed86 to
f4967da
Compare
f4967da to
d804b60
Compare
Adds experimental V3 columnar encoding for series and sketch metrics behind the serializer_experimental_use_v3_api.* config keys, including a V2/V3 validation mode, V3 intake routing/filtering, intake-side V3 payload parsing for correctness testing, and the dsd-plain-v3 correctness cases. Existing V1/V2 series and sketch encoding is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
348f227 to
d203dbf
Compare
## Summary <!-- Please provide a brief summary about what this PR does. This should help the reviewers give feedback faster and with higher quality. --> Adds V3 series shadow sampling support to ADP, matching the Core Agent config/defaults for: - `serializer_experimental_use_v3_api.series.shadow_sample_rate` - `serializer_experimental_use_v3_api.series.shadow_sites` - `serializer_experimental_use_v3_api.series.beta_route` When series V3 is not authoritative, ADP can now sample V2 series flushes and send a correlated V3 beta shadow payload with the same metrics validation batch headers. Shadowing is limited to V2 series baselines, matching the Core Agent behavior. Also fixes the V3 correctness harness so the new `dsd-plain-v3` cases decode and compare V3 payloads correctly. Fake intake now handles V3 metric routes, and `stele` normalizes V3 columnar payloads, including host resources and Agent-compatible sketch summary ordering. Todo / Follow Up: Shadow sampling is currently encoder-scoped in ADP. A follow-up should make it endpoint/resolver-scoped like the Core Agent for mixed endpoint and multi-site configurations. ## Change Type - [ ] Bug fix - [x] New feature - [ ] Non-functional (chore, refactoring, docs) - [ ] Performance ## How did you test this PR? <!-- Please how you tested these changes here --> Unit Tests / CI ## References <!-- Please list any issues closed by this PR. --> <!-- - Closes: <issue link> --> <!-- Any other issues or PRs relevant to this PR? Feel free to list them here. -->

Summary
Change Type
How did you test this PR?
References